Universal Access To All Knowledge
Home Donate | Store | Blog | FAQ | Jobs | Volunteer Positions | Contact | Bios | Forums | Projects | Terms, Privacy, & Copyright
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: aibek Date: Jan 24, 2014 9:20pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

> Does this also mean that the CDX API's own collapse function is unreliable at removing adjacent duplicates?

According to your report, it is not the CDX Server which is at fault, but the fact that different digests are being created for the same page.

Why the same pages get different digests is an interesting question! (if true; I will check.) Two observations though,
(i) I could not find how the digests are being computed by the Wayback Machine, when I last looked into the issue. It does not seem to be a straightforward computation of digest of the files downloaded. At least, it is not a straightforward computation using 20 or so most popular algorithms. (md5, sha1, etc)
(ii) I am not sure how the webpages are saved by the Wayback Machine, but I assume that only one copy per digest is being saved. (to save on storage space) If that assumption is true, and if it is true that different digests are being created for exactly the same content, then it is a major bug, as terabytes of space on the IA server is being wasted.

Reply to this post
Reply [edit]

Poster: Zarkoff Date: Jan 29, 2014 9:06am
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

Such relief to hear from someone interested in the same questions, aibek.

The CDX digest is an SHA-1 hash according to this:

http://crawler.archive.org/apidocs/constant-values.html#org.archive.io.ArchiveFileConstants.CDX

Your suggestions for removing wayback alterations to archive pages are very useful. I wonder if the Wayback Machine provides a means of querying for pages to be delivered with only their original attributes?

Thank you for confirming the apparent error. The CDX documentation doesn't say what the digest is explicitly. Though it looks very much like a unique identifier on the basis that the CDX server to uses it to collapse adjacent duplicates, and that it is documented as an SHA-1 hash.

Reply to this post
Reply [edit]

Poster: aibek Date: Jan 30, 2014 7:27pm
Forum: forums Subject: Re: CDX digest not accurately capturing duplicates?

The digest isn’t the SHA-1 hash. For the above linked Google logo gif, SHA-1 hash is fd852df5478eb7eb9410ee9101bb364adf487fb0. None of the digests recorded on the CDX page is this.

There must be ways to get the original unmodified page. (I don’t know any, though.)

Contrawise, (i) the Wayback Machine almost certainly modifies the links on-the-fly, and (ii) the webpages saved are expected to differ in all imaginable and unimaginable ways. Therefore the code doing the modification must be trivial, as it is supposed to work on every webpage saved without breaking it. I am willing to bet that it does nothing more than inserting a few lines, and changing all the urls to prefix web.archive.org/web/TIMESTAMP. As such, after comparing a few pages, once you have identified what exactly is being changed, you could be certain -- for all practical purposes -- that by doing the opposite (i.e., deleting the two additions), you are getting the original pages back. (In case you go this route, please post what you learn on the forum too, as it would be useful to others.)

Also, note that you may not have to check the digests for all the saved pages. If you are willing to assume that same digest imply same page (even though we already know that same page does-not-imply same digest), you could proceed with the CDX collapse-based-on-digest result! (i.e., exactly where you started) That is to say, the CDX collapse result would have already removed the adjacent same digest records; you could use this list as your master list.

This post was modified by aibek on 2014-01-31 03:27:04